In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.
The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].
Please go through the steps below, build up the necessary code and comment on your choices.
Abstract: This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Set Information: Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Attribute Information:
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
Tasks:
import seaborn as sns
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
# TASK 1.3
day_data = pd.read_csv("day.csv")
hour_data = pd.read_csv("hour.csv")
day_data.tail()
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 726 | 727 | 2012-12-27 | 1 | 1 | 12 | 0 | 4 | 1 | 2 | 0.254167 | 0.226642 | 0.652917 | 0.350133 | 247 | 1867 | 2114 |
| 727 | 728 | 2012-12-28 | 1 | 1 | 12 | 0 | 5 | 1 | 2 | 0.253333 | 0.255046 | 0.590000 | 0.155471 | 644 | 2451 | 3095 |
| 728 | 729 | 2012-12-29 | 1 | 1 | 12 | 0 | 6 | 0 | 2 | 0.253333 | 0.242400 | 0.752917 | 0.124383 | 159 | 1182 | 1341 |
| 729 | 730 | 2012-12-30 | 1 | 1 | 12 | 0 | 0 | 0 | 1 | 0.255833 | 0.231700 | 0.483333 | 0.350754 | 364 | 1432 | 1796 |
| 730 | 731 | 2012-12-31 | 1 | 1 | 12 | 0 | 1 | 1 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
day_data.shape
(731, 16)
Split the data into two parts. One dataset containing the last 30 days and one dataset with the rest.
day_data1 = day_data[0:701]
#data for last 30 days
day_data2 = day_data[701:731]
day_data2.shape
(30, 16)
Answers / comments / reasoning:
Tasks:
nmax that was needed in any one day.n95 that was needed in any one day.nmax bicycles would cover 100% of days, n95 covers 95%, etc.)Task: Perform all needed steps to load and clean the data. Please comment the major steps of your code.
day_data2.head(2)
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 701 | 702 | 2012-12-02 | 4 | 1 | 12 | 0 | 0 | 0 | 2 | 0.3475 | 0.359208 | 0.823333 | 0.124379 | 892 | 3757 | 4649 |
| 702 | 703 | 2012-12-03 | 4 | 1 | 12 | 0 | 1 | 1 | 1 | 0.4525 | 0.455796 | 0.767500 | 0.082721 | 555 | 5679 | 6234 |
day_data.isnull().sum() #no null data
instant 0 dteday 0 season 0 yr 0 mnth 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 0 windspeed 0 casual 0 registered 0 cnt 0 dtype: int64
day_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 731 non-null int64 1 dteday 731 non-null object 2 season 731 non-null int64 3 yr 731 non-null int64 4 mnth 731 non-null int64 5 holiday 731 non-null int64 6 weekday 731 non-null int64 7 workingday 731 non-null int64 8 weathersit 731 non-null int64 9 temp 731 non-null float64 10 atemp 731 non-null float64 11 hum 731 non-null float64 12 windspeed 731 non-null float64 13 casual 731 non-null int64 14 registered 731 non-null int64 15 cnt 731 non-null int64 dtypes: float64(4), int64(11), object(1) memory usage: 91.5+ KB
Only 1 column : " Date of the day " is of object type, converting it to datetime format. All other columns are already category encoded as mentioned in dataset description too.
from datetime import datetime
day_data1["dteday"]= pd.to_datetime(day_data1["dteday"]);
day_data2["dteday"]= pd.to_datetime(day_data2["dteday"]);
C:\Users\SHELVI~1\AppData\Local\Temp/ipykernel_18112/3424481191.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy day_data1["dteday"]= pd.to_datetime(day_data1["dteday"]); C:\Users\SHELVI~1\AppData\Local\Temp/ipykernel_18112/3424481191.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy day_data2["dteday"]= pd.to_datetime(day_data2["dteday"]);
#Outliers are calculated only for numerical values
sns.boxplot(data=day_data["cnt"])
<AxesSubplot:>
day_data.duplicated().sum()
0
There are no duplicated instances in our data.
#For coplete Database
d = day_data.drop(["dteday","instant"],axis=1)
day_data_corr = d.corr()
plt.figure(figsize=(13,10))
ax = sns.heatmap(day_data_corr,annot=True)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
From correlation matrix, higher correlational variables can be observed:
and others
d = day_data.drop(["cnt","casual","registered","dteday","instant"],axis=1)
imp = d.apply(lambda x: x.corr(day_data.cnt))
indices = np.argsort(imp)
print(imp[indices]) #Sorted in ascending order
weathersit -0.297391 windspeed -0.234545 hum -0.100659 holiday -0.068348 workingday 0.061156 weekday 0.067443 mnth 0.279977 season 0.406100 yr 0.566710 temp 0.627494 atemp 0.631066 dtype: float64
for i in range(0, len(indices)):
if np.abs(imp[i])<0.1:
print(d.columns[i])
holiday weekday workingday
Hence, Columns with Lowest correlation with target Variables are:
All other variables carries values greater than 0.2 hence have good impact on values of rental bikes count.
Answers / comments / reasoning:
As Outliers are calculated only for numerical data and not categorical data, the only column that are numerical and can be judged on outliers are our dependent columns: casual , registered and count, for which we dont need to see the Outliers.Further deductions:
There are no duplicated instances in our data.
There is also no need to encode categorical data, as data is already encoded for categorical columns!
Hence, Columns with Lowest correlation with target Variables are:holiday, workingday,weekday with 0.06 value suprisingly!All other variables carries values greater than 0.2 hence have good impact on values of rental bikes count.
From correlation matrix, higher correlational variables can be observed:
and others
For columns in the dataset:
import plotly.express as px
fig = px.line(day_data1, x='dteday', y='cnt', title='Rental Bikes Per Day')
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1, label="1m", step="month", stepmode="backward"),
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(count=1, label="YTD", step="year", stepmode="todate"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
Assume that each bike has exactly maximum 12 rentals per day.
No. of bikes required Per Day,where x is the rides for that day: $$ x-rentals ---> ((1/12) * x)-bikes $$
# Number of bikes hence needed in a day:
print(day_data1.shape)
bikes_req = [math.ceil(day_data1["cnt"][i]/12) for i in range(0, 701)]
nmax = max(bikes_req)
per95 = np.percentile(bikes_req,95)
nmax,per95
(701, 16)
(727, 632.0)
len(bikes_req),len(day_data1)
(701, 701)
# day_data1["max bikes required"] = bikes_req;
# # day_data1.head(2)
day_data1.shape, day_data1.head()
((701, 16),
instant dteday season yr mnth holiday weekday workingday \
0 1 2011-01-01 1 0 1 0 6 0
1 2 2011-01-02 1 0 1 0 0 0
2 3 2011-01-03 1 0 1 0 1 1
3 4 2011-01-04 1 0 1 0 2 1
4 5 2011-01-05 1 0 1 0 3 1
weathersit temp atemp hum windspeed casual registered \
0 2 0.344167 0.363625 0.805833 0.160446 331 654
1 2 0.363478 0.353739 0.696087 0.248539 131 670
2 1 0.196364 0.189405 0.437273 0.248309 120 1229
3 1 0.200000 0.212122 0.590435 0.160296 108 1454
4 1 0.226957 0.229270 0.436957 0.186900 82 1518
cnt
0 985
1 801
2 1349
3 1562
4 1600 )
# day_data1 = day_data1.sort_values(by="cnt")
# day_data1.head(2)
Hence,
#maximum no of rides in any day and max bikes avail:
day_data1["cnt"].max(), max(bikes_req)
(8714, 727)
Task: Vizalize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)
# percentile = []
# for i in range(0,101):
# percentile.append(np.percentile(bikes_req,i))
# percentile[0:5]
Hence,
i.e. if bikes_req = 727 -- 100% coverage , 2 bikes are for 0% coverage
#calculatng percentile raank for available bikes data
values = np.array(bikes_req) #no of available bikes each day
min_value = values.min()
max_value = values.max()
percentiles_rank = (values - min_value) / (max_value - min_value) * 100
percentiles_rank[0:5]
array([11.17241379, 8.96551724, 15.31034483, 17.79310345, 18.20689655])
# data = day_data1["dteday"]
# date = pd.to_datetime(data)
# date0 = date[0]
# date0.day,date0.month
import matplotlib.pyplot as plt
day = day_data1["dteday"]
x = day
y = percentiles_rank
fig2 = px.scatter(x=day, y=y, title=' % Covered Days Depending on Bikes Availability ')
fig2.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1, label="1m", step="month", stepmode="backward"),
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(count=1, label="YTD", step="year", stepmode="todate"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig2.show()
Tasks:
While there are many evaluation metrics to evalaute regression model performance , including:
I will be using RMSE because of the following reasons:
Though I will be most relying on RMSE , I will be using more than 1 evaluation metric to measure my model accuracy more accurately! I will also be testing my model on train data along with test to observe overfitting if any.
#define target variable:
X = day_data1.drop(["cnt","casual","registered","dteday"],axis=1)
y = day_data1["cnt"]
#test-train split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
model1 = RandomForestRegressor(n_estimators = 50, random_state = 0)
# fit the regressor with x and y data
model1.fit(X_train, y_train)
# Actual class predictions
pred1_train = model1.predict(X_train)
pred1 = model1.predict(X_test)
#define target variable:
X = day_data1.drop(["cnt","casual","registered","dteday","holiday","weekday","workingday"],axis=1)
y = day_data1["cnt"]
#test-train split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
model2 = RandomForestRegressor(n_estimators = 50, random_state = 0)
# fit the regressor with x and y data
model2.fit(X_train, y_train)
# Actual class predictions
pred2_train = model2.predict(X_train)
pred2 = model2.predict(X_test)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math
print("RMSE Train",math.sqrt(mean_squared_error(y_train, pred1_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, pred1)),"\n")
print("MSE train",mean_squared_error(y_train, pred1_train))
print("MSE test",mean_squared_error(y_test, pred1))
print("MAE train",mean_absolute_error(y_train,pred1_train))
print("MAE test",mean_absolute_error(y_test,pred1))
RMSE Train 252.7092149325717 RMSE Test 722.3279923436601 MSE train 63861.947311836724 MSE test 521757.7285232227 MAE train 171.2804489795918 MAE test 471.7901421800948
print("RMSE Train",math.sqrt(mean_squared_error(y_train, pred2_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, pred2)),"\n")
print("MSE train",mean_squared_error(y_train, pred2_train))
print("MSE test",mean_squared_error(y_test, pred2))
print("MAE train",mean_absolute_error(y_train,pred2_train))
print("MAE test",mean_absolute_error(y_test,pred2))
RMSE Train 259.35963901541055 RMSE Test 718.5624209713638 MSE train 67267.42235020408 MSE test 516331.9528322275 MAE train 179.10477551020406 MAE test 495.7623696682465
Answers / comments / reasoning:
Tasks:
# TODO: your code comes here
#define target variable:
X = day_data1.drop(["cnt","casual","registered","dteday","holiday","weekday","workingday"],axis=1)
y = day_data1["cnt"]
#test-train split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
model3 = RandomForestRegressor(n_estimators = 500,
max_depth = 10,
n_jobs = -1,
verbose = 0,
min_samples_split= 2,
min_samples_leaf=1,
bootstrap = True,
random_state = 0)
# fit the regressor with x and y data
model3.fit(X_train, y_train)
# Actual class predictions
pred3_train = model3.predict(X_train)
pred3 = model3.predict(X_test)
print("RMSE Train",math.sqrt(mean_squared_error(y_train, pred3_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, pred3)),"\n")
print("MSE train",mean_squared_error(y_train, pred3_train))
print("MSE test",mean_squared_error(y_test, pred3))
print("MAE train",mean_absolute_error(y_train,pred3_train))
print("MAE test",mean_absolute_error(y_test,pred3))
RMSE Train 284.66069352750674 RMSE Test 727.2314413703605 MSE train 81031.71043956112 MSE test 528865.569317612 MAE train 206.88636739402696 MAE test 493.53515150415757
# TODO: your code comes here
#define target variable:
X = day_data1.drop(["cnt","casual","registered","dteday"],axis=1)
y = day_data1["cnt"]
#test-train split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
model4 = RandomForestRegressor(n_estimators = 1000,
max_depth = 20,
n_jobs = -1,
verbose = 0,
min_samples_split= 2,
min_samples_leaf=1,
bootstrap = True,
random_state = 0)
# fit the regressor with x and y data
model4.fit(X_train, y_train)
# Actual class predictions
pred4_train = model4.predict(X_train)
pred4 = model4.predict(X_test)
print("RMSE Train",math.sqrt(mean_squared_error(y_train, pred4_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, pred4)),"\n")
print("MSE train",mean_squared_error(y_train, pred4_train))
print("MSE test",mean_squared_error(y_test, pred4))
print("MAE train",mean_absolute_error(y_train,pred4_train))
print("MAE test",mean_absolute_error(y_test,pred4))
RMSE Train 244.1907180129386 RMSE Test 717.8449844767662 MSE train 59629.106763674485 MSE test 515301.4217384487 MAE train 163.5590916290512 MAE test 460.7849645487214
# TODO: your code comes here
#define target variable:
X = day_data1.drop(["cnt","casual","registered","dteday"],axis=1)
y = day_data1["cnt"]
#test-train split
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
model5 = RandomForestRegressor(n_estimators = 600,
max_depth = 10,
n_jobs = -1,
verbose = 0,
min_samples_split= 2,
min_samples_leaf=2,
bootstrap = False,
random_state = 0)
# fit the regressor with x and y data
model5.fit(X_train, y_train)
# Actual class predictions
pred5_train = model5.predict(X_train)
pred5 = model5.predict(X_test)
print("RMSE Train",math.sqrt(mean_squared_error(y_train, pred5_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, pred5)),"\n")
print("MSE train",mean_squared_error(y_train, pred5_train))
print("MSE test",mean_squared_error(y_test, pred5))
print("MAE train",mean_absolute_error(y_train,pred5_train))
print("MAE test",mean_absolute_error(y_test,pred5))
RMSE Train 272.523403799974 RMSE Test 876.5251609014338 MSE train 74269.00561872368 MSE test 768296.3576932843 MAE train 195.61718011013906 MAE test 596.7793964868717
from sklearn.model_selection import cross_val_score
model6 = RandomForestRegressor(n_estimators = 600,
max_depth = 15,
n_jobs = -1,
verbose = 0,
min_samples_split= 2,
min_samples_leaf=2,
bootstrap = True,
max_features = "auto",
random_state = 0)
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(model6, X, y,
cv=5,
scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)
MAE scores: [1098.42860381 430.23286934 593.50551858 932.92524814 784.04907848]
print("Average MAE score (across experiments):")
print(scores.mean())
Average MAE score (across experiments): 767.8282636711754
Without Hypertuning (but removing less correlational columns):
RMSE Test 718.5624209713638
MSE train 67267.42235020408
Best performer After Hypyertuning:
RMSE Test 717.8449844767662
MSE train 59629.106763674485
Answers / comments / reasoning:
-
Tasks:
nmax (from part 2) bicycles for the next 30 days given that the real observed values are expected to be different from average predicted values. Calculate the demand by adding the simulated residuals to calculated expected values from the data you put aside in part 1.Task 5.1,5.2 : Assume that the revenue per rental is x (your own assumed number).Each bicycle has costs of y per day (your own assumed number).
Task:
Residual = Observed Value - Predicted Value
len(pred4)
211
y_test = y_test.to_list()
#Calculating residual for test data:
#Residual = Observed Value - Predicted Value
residual = []
residual = [round((y_test[i] - pred4[i]),5) for i in range(len(y_test))]
residual,type(residual)
residual[0:5]
#Considering residuals as random shcoks that are resulting in real observed values.
[1202.724, -100.663, 33.997, -1171.346, 61.9078]
#Plotting distribution of Actual residual for test data:
import matplotlib.pyplot as plt
plt.hist(residual)
plt.show()
#calculating mean and std considering distribution is guassian
import statistics as stats
mean = stats.mean(residual)
std = stats.stdev(residual)
mean,std
(-28.576947440758296, 718.9817155448583)
#Approximation of guassian distribution using earlier calculated mu and sigma -- where to sample from
#For 1000 samples:
new_dist = np.random.normal(mean,std,1000)
print(new_dist[0:10])
plt.hist(new_dist)
plt.show()
[ 8.10603206 62.97314038 -1956.73960861 1184.9010053 643.73702749 -1689.93025726 -1060.44743039 329.52430811 -32.9609128 372.33731039]
Simulate the profit with a fixed number of nmax (from part 2) bicycles for the next 30 days given that the real observed values are expected to be different from average predicted values. Calculate the demand by adding the simulated residuals to calculated expected values from the data you put aside in part 1.
#Sampling 30 data points from simulated residuals for 30 days demand calculation
new_dist30 = np.random.choice(new_dist,30)
print(new_dist30[0:5])
[-590.95621295 -613.61212603 -605.96451592 266.52348497 -905.01475802]
#Simulate the profit with a fixed number of nmax (from part 2) bicycles for the next 30 days given that the real observed values
#are expected to be different from average predicted values.
#Calculate the demand by adding the simulated residuals to calculated expected values from the data you put aside in part 1.
print("nmax: ",nmax)
#Predicting demnd for 30 days:
X30 = day_data2.drop(["cnt","casual","registered","dteday"],axis=1)
y30 = day_data2["cnt"]
y30 = y30.to_list()
pred30 = model4.predict(X30)
print("predicted: ",pred30[0:5])
print("actual: ",y30[0:5])
demand30 = [(new_dist30[i] + pred30[i]) for i in range(len(pred30))] #demand after adding random shock
len(demand30), y30[25:],pred30[25:],demand30[25:]
nmax: 727 predicted: [4152.548 6173.039 6473.723 6254.256 4536.314] actual: [4649, 6234, 6606, 5729, 5375]
(30, [2114, 3095, 1341, 1796, 2729], array([3917.3 , 4515.908, 4152.629, 3455.244, 4478.888]), [2242.0976139490276, 4357.752350709818, 3069.5193033330634, 4423.893114619684, 4501.861896014093])
#Profit for 30 days:
#expenditure for single day: Fixed value
expend30 = nmax*500 #a fixed number - considering fixed nmax
#earnings of single day
earning = [demand30[i]*100 for i in range(len(demand30))]
#Profit for each day in 30 days:
profit30 = [earning[i] - expend30 for i in range(len(demand30))]
print(profit30[0:5])
print(np.mean(profit30))
total = 0
for ele in range(0, len(profit30)):
total = total + profit30[ele]
total
[-7340.82129506988, 192442.68739738176, 223275.84840798413, 288577.9484974765, -370.0758018420893] 99997.11902163333
2999913.570649
#Plotting profit for 30 days:
x = [i for i in range(1,31)]
# x = day_data2["dteday"] #label with dteday for day_Data2
y = profit30
plt.plot(x,y)
plt.xlabel("Days")
plt.ylabel("Profit")
plt.title(label="Profit for 30 Days")
plt.show()
Use grid search along the number of available bikes to find the optimal number of bikes to obtain highest profit (revenue - cost) from simulations.
n_estimators = [ int(x) for x in np.linspace(start=10,stop=80,num=10)]
max_features = [ "auto","sqrt"]
max_depth= [2,4]
# min_sample_split= [2,4]
min_samples_leaf = [1,2]
bootstrap = [True,False]
#Creating param grid
param_grid = {"n_estimators":n_estimators,
"max_features" : max_features,
"max_depth" : max_depth,
# "min_sample_split" : min_sample_split,
"min_samples_leaf" : min_samples_leaf,
"bootstrap" : bootstrap}
print(param_grid)
model_grid = RandomForestRegressor()
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator= model_grid,param_grid=param_grid,cv=3,verbose=2,n_jobs=4)
grid_search.fit(X_train,y_train)
{'n_estimators': [10, 17, 25, 33, 41, 48, 56, 64, 72, 80], 'max_features': ['auto', 'sqrt'], 'max_depth': [2, 4], 'min_samples_leaf': [1, 2], 'bootstrap': [True, False]}
Fitting 3 folds for each of 160 candidates, totalling 480 fits
GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=4,
param_grid={'bootstrap': [True, False], 'max_depth': [2, 4],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2],
'n_estimators': [10, 17, 25, 33, 41, 48, 56, 64, 72,
80]},
verbose=2)
grid_search.best_params_
{'bootstrap': False,
'max_depth': 4,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'n_estimators': 41}
# Actual class predictions
grid_train = grid_search.predict(X_train)
grid = grid_search.predict(X_test)
print("RMSE Train",math.sqrt(mean_squared_error(y_train, grid_train)))
print("RMSE Test",math.sqrt(mean_squared_error(y_test, grid)),"\n")
print("MSE train",mean_squared_error(y_train, grid_train))
print("MSE test",mean_squared_error(y_test, grid))
print("MAE train",mean_absolute_error(y_train,grid_train))
print("MAE test",mean_absolute_error(y_test,grid))
RMSE Train 665.3544366201504 RMSE Test 863.2243989672205 MSE train 442696.5263301178 MSE test 745156.362972319 MAE train 505.1924466994856 MAE test 608.3985754179388
#Predicting demnd for 30 days:
X30 = day_data2.drop(["cnt","casual","registered","dteday"],axis=1)
y30 = day_data2["cnt"]
y30 = y30.to_list()
pred30_grid = grid_search.predict(X30)
print("predicted: ",pred30_grid[0:5])
print("actual: ",y30[0:5])
demand30 = [(new_dist30[i] + pred30_grid[i]) for i in range(len(pred30))] #demand after adding random shock
nmax_grid = max([math.ceil(pred30_grid[i]/12) for i in range(len(y30))])
print("optimal bikes:nmax grid" ,nmax_grid)
predicted: [4227.81347393 5909.23128737 6095.79988335 5682.82537438 3896.0213052 ] actual: [4649, 6234, 6606, 5729, 5375] optimal bikes:nmax grid 508
#Profit for 30 days:
#expenditure for single day: Fixed value
expend30 = nmax_grid*500 #a fixed number - considering fixed nmax
#earnings of single day
earning = [demand30[i]*100 for i in range(len(demand30))]
#Profit for each day in 30 days:
profit30 = [earning[i] - expend30 for i in range(len(demand30))]
print(profit30[0:5])
print(np.mean(profit30))
total = 0
for ele in range(0, len(profit30)):
total = total + profit30[ele]
total
[109685.72609787696, 275561.91613445303, 294983.5367429276, 340934.88593569363, 45100.654717836704] 182163.95222729334
5464918.566818801
When i considered 1 bike rent is 100 , and bike expenditure to be 500 , my average profit for 30 days came about 10,000 rs per day, some with negative, some positive.
Overall 30 days profit with Optimal Bikes(Grid Search) :5464918 ( 54 lakh for 30 days)
hence, our profit improved with using optimal bikes predicted using grid search
Tasks: (Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.
Taks 6:
Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.
[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.